# simulating flipping 50 coins:
coinflips <- sample(c("heads","tails"), size=50, replace=TRUE)
# counting the number of heads
prop_heads<-mean(coinflips=='heads')Inference
Reminders
You should have received an email about signing up for the GVPT experimental lab
Check the exam 1 module for a list of things to know for exam 1. Be ready to ask about them on Monday.
For today
We’ll start on introducing material from chapter 6. This will not be on the upcoming midterm, but we’ll revisit it after the break
We’ll talk about some math, but ultimately we don’t need a lot beyond square roots to understand what’s going on. The more important part is the intuition.
How do we calculate a margin of error for a survey?
Margin of error and surveys
Note: the general principles here apply to analyses of non-survey data as well, but we’ll talk in terms of surveys because that tends to be a bit more familiar for people (or at least gov majors)
Sampling
Remember that:
We want to talk about the population: the beliefs and behaviors of all eligible voters, or all countries, or all adults. We’ll call characteristics of this group (mean, standard deviation, median etc) population parameters
But we typically only have data on a sample from the population. We’ll call characteristics of this group sample statistics
Sampling
We don’t expect the a sample mean to exactly equal a population mean. But we expect it to have some random error.
Sample Statistic = Population Parameter \(\pm\) Random Error
We don’t know the actual size of the error for a given statistic, but we can estimate the probability of an error greater than some value.
Uncertainty and sampling
The general intuition:
more data = more certainty. Estimates from a really big representative sample are going to be more likely to approximate the true population mean.
less variance = more certainty. If people in the target population are pretty much the same, then it is more likely that a given sample will resemble the target population.
Margin of error
Margin of error
Margin of error
The margin of error is more frequently called a “confidence interval” in statistics.
A confidence interval is a range that we are reasonably certain contains the actual population value.
Specifically, the margin generally reported in a survey is a 95% confidence interval when there’s a 50/50 split in a dichotomous outcome.
If I estimate that 50% of respondents are Democrats with a margin of error of +/- 3% then I’m saying I’m 95% certain that the range 47-53% contains the true population % of Dems.
Margin of error = Confidence Interval
95% confidence means: “if I repeated this survey an infinite number of times and recalculated the CI for each one, 95% of my calculated confidence intervals would contain the actual population value.”
Margin of error = Confidence Interval
The Central Limit Theorem
As the number of repeated samples approaches infinity, the sampling distribution of the sample mean will converge toward a normal distribution centered on the population mean.
The sampling distribution of a sample mean
Coin flips
Think about flipping a coin:
the probability of landing on heads is 50% (this is our population parameter)
We’ll run an “experiment” where we flip a coin 50 times and calculate the proportion of heads.
- (remember that, for a dummy (0, 1) coded dichotomous outcome, the proportion = the mean)
Coin flips
This is hardly surprising!
the proportion of heads is 0.38.
The error is .5 - 0.38 = -0.12
. . .
I get closer to the correct answer when I increase the sample size, but I still will have some amount of error:
# simulating flipping 50 coins:
coinflips <- sample(c("heads","tails"), size=500, replace=TRUE)
# counting the number of heads
prop_heads <- mean(coinflips=='heads')the proportion of heads is 0.51.
The error is .5 - 0.51 = 0.01
The sampling distribution of coin flips
Repeating this experiment 5,000 times will give us a sense of the sampling distribution for this sample statistic:
# 5000 "samples" of the coin-flipping experiment
trials<-replicate(5000, sample(c("heads","tails"), size=50, replace=TRUE)=="heads")
# get the average num of heads in each trial. Expected value = .5
mean_heads <- colMeans(trials)
# draw a histogram
hist(mean_heads, main ='proportion of heads in 50 coin flips', breaks=30)The sampling distribution of coin flips
- The samples are centered around the expected proportion of .5 heads
- More extreme errors are less common than small ones
- The results are symmetric: there are about as many high values as low ones.
- Error = Sample Mean - Population Mean, so the distribution of errors has this same shape, but its centered around zero.
The sampling distribution of coin flips
Coin flips technically follow a binomial distribution but our sampling distribution looks a lot like a normal “bell-curve” centered on .5
Sampling distributions of sample means
In fact, for a sufficient sample size, sampling distributions of sample means will always end up looking like this regardless of the actual population distribution.
The normal distribution
We know a lot about bell curves. If we have mean \(\mu\) and a standard deviation \(\sigma\), we can easily find how much data we expect to fall anywhere within this shape using the probability density function.
. . .
\[f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left( -\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^{\!2}\,\right)\]
. . .
(Maybe not the kind of thing you want to calculate in your head, but pretty easy for a computer)
The normal distribution
The normal distribution
The mean (\(\mu\)) and standard deviation \(\sigma\) will depend on the data, but we can always:
. . .
- Mean center: subtract the mean \(\mu\) from each observation so that the new mean is zero.
. . .
- Standardize: divide each observation by the standard deviation of the sampling distribution (\(\sigma\)), so the new standard deviation is 1
. . .
We’ll call this new number the “Z-score”:
\[ \text{Z score} = \frac{\text{Deviation from the mean}}{\text{Standard Deviation}} \]
The Standard Normal Distribution
The Central Limit Theorem
As the number of repeated samples approaches infinity, the sampling distribution of the mean will converge toward a normal distribution centered on the population mean.
. . .
This means that:
- The sampling errors will also have a normal distribution centered on zero.
- Sampling errors/means are normally distributed even if the population is non-normal.
- We can calculate the proportion of data that falls under different parts of the curve, provided we know \(\mu\) and \(\sigma\)
So what?
Using the CLT
We don’t need to take infinite samples to make use of this! We can draw a sample and assume that it comes from a distribution with this shape:
Using the CLT
\[f(x) = \frac{1}{\color{red}{\sigma}\sqrt{2\pi}} \exp\left( -\frac{1}{2}\left(\frac{x-\color{red}{\mu}}{\color{red}{\sigma}}\right)^{\!2}\,\right)\]
Remember: we need to know \(\mu\) and \(\sigma\)
We don’t know our population mean \(\mu\), but we do know that the mean sampling error is zero, so we can just use 0 our expected mean error.
However, we still need to know the standard deviation \(\sigma\)
Here, we’re looking for the standard deviation of the sampling errors. We’ll call this value our standard error or SE for short.
Finding this is a challenge, but we can make some simple assumptions here to get a “worst case scenario” for a survey
The standard error
Remember that more observations = more certainty. So we expect that the standard error will decrease when taking larger samples.
. . .
We actually know how much the SE shrinks for each observation:
\[ \text{Standard error (SE)} = \frac{\sigma}{\sqrt{n}} \]
The standard error of a proportion
\[ \text{Standard error (SE)} = \frac{\sigma}{\sqrt{n}} \]
We don’t know \(\sigma\), but surveys are usually reporting a proportion or percentage of people giving a particular response. The standard error of a proportion actually depends on the proportion itself:
\[ \text{Standard Error of a proportion} = \frac{\sqrt{p\times(1-p)}}{\sqrt{n}} = \sqrt{\frac{p\times(1-p)}{n}} \]
\[ p = \text{the proportion} \]
The standard error of a proportion
Note that this part of the formula is at its maximum when p = .5 (i.e. when there’s a 50/50 split on a yes/no question)
\[ \text{Standard Error of a proportion} = \sqrt{\frac{\color{red}{p\times(1-p)}}{n}} \]
So, if we know our sample size, we can calculate the worst case scenario for the standard error where \(p = .5\)
Back to that margin
- We know our sample size is 2,091
- We know that the biggest SE possible is when \(p = .5\), \(\sqrt{\frac{.5(1-.5)}{n}}\), so the worst case scenario is \(\sqrt{\frac{.5(1-.5)}{2,091}} \approx 0.0109\)
- We also know that the errors are normally distributed and centered around 0
We have all the ingredients to we need to apply the CLT here!
The Margin of error
Remember we want a 95% confidence level.
We already know that for this pdf, 95% of the errors will fall with in approximately \(\pm 2 \times Z\) of 0 (technically closer to 1.95994…)
The Margin of error
\[ \text{Standard Error of a proportion} = \sqrt{\frac{p\times(1-p)}{n}} \]
\[ \text{SE for this sample at (p=0.5)} = \sqrt{\frac{.5(1-.5)}{2,091}} \approx 0.0109 \]
So now we just need:
\[ \text{standard error} \times \text{Z-score for a 95% confidence level} = \]
\[ .0109 \times 1.96 \approx 0.0214 = 2.14\% \]
The Margin of error
So what is the margin of error on a poll?
Its a 95% confidence interval for a worst case scenario where there’s a 50/50 split on a question with two response categories.
The Central Limit Theorem
When I take a decent-sized random sample from the population:
I know the sample mean (\(\bar{x}\)) is drawn from a normal distribution.
I know the mean of this sampling distribution is equal to the population mean.
I know the standard deviation of this sampling distribution (a.ka. the standard error) is equal to \(\frac{\sigma}{\sqrt{n}}\)
- So: I know that 95% of my sample averages will be within \(\approx 1.96\) standard errors of the correct answer (and any number of other probabilities based on the normal curve)
Calculating a 95% confidence interval
\[ \text{lower 95% confidence boundary} = \bar{x} - 1.96 \text{ standard errors} \]
\[ \text{upper 95% confidence boundary} = \bar{x} + 1.96 \text{ standard errors} \]
. . .
Sometimes we’ll just round this to “2” because its usually good enough.
Next questions:
The margin of error is a worst case scenario. Its convenient to have one number that can apply to the entire survey, but we can generally be more precise with our confidence intervals.
We often want to estimate confidence for things that aren’t proportions, so how do we do this without knowing the population value for \(\sigma\)?